Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
In the financial sphere, there is a wealth of accumulated unstructured financial data, such as the textual disclosure documents that companies submit on a regular basis to regulatory agencies, such as the Securities and Exchange Commission (SEC). These documents are typically very long and tend to contain valuable soft information about a company’s performance that is not present in quantitative predictors. It is therefore of great interest to learn predictive models from these long textual documents, especially for forecasting numerical key performance indicators (KPIs). In recent years, there has been a great progress in natural language processing via pre-trained language models (LMs) learned from large corpora of textual data. This prompts the important question of whether they can be used effectively to produce representations for long documents, as well as how we can evaluate the effectiveness of representations produced by various LMs. Our work focuses on answering this critical question, namely the evaluation of the efficacy of various LMs in extracting useful soft information from long textual documents for prediction tasks. In this paper, we propose and implement a deep learning evaluation framework that utilizes a sequential chunking approach combined with an attention mechanism. We perform an extensive set of experiments on a collection of 10-K reports submitted annually by US banks, and another dataset of reports submitted by US companies, in order to investigate thoroughly the performance of different types of language models. Overall, our framework using LMs outperforms strong baseline methods for textual modeling as well as for numerical regression. Our work provides better insights into how utilizing pre-trained domain-specific and fine-tuned long-input LMs for representing long documents can improve the quality of representation of textual data, and therefore, help in improving predictive analyses.more » « less
-
Unstructured data, especially text, continues to grow rapidly in various domains. In particular, in the financial sphere, there is a wealth of accumulated unstructured financial data, such as the textual disclosure documents that companies submit on a regular basis to regulatory agencies, such as the Securities and Exchange Commission (SEC). These documents are typically very long and tend to contain valuable soft information about a company's performance. It is therefore of great interest to learn predictive models from these long textual documents, especially for forecasting numerical key performance indicators (KPIs). Whereas there has been a great progress in pre-trained language models (LMs) that learn from tremendously large corpora of textual data, they still struggle in terms of effective representations for long documents. Our work fills this critical need, namely how to develop better models to extract useful information from long textual documents and learn effective features that can leverage the soft financial and risk information for text regression (prediction) tasks. In this paper, we propose and implement a deep learning framework that splits long documents into chunks and utilizes pre-trained LMs to process and aggregate the chunks into vector representations, followed by self-attention to extract valuable document-level features. We evaluate our model on a collection of 10-K public disclosure reports from US banks, and another dataset of reports submitted by US companies. Overall, our framework outperforms strong baseline methods for textual modeling as well as a baseline regression model using only numerical data. Our work provides better insights into how utilizing pre-trained domain-specific and fine-tuned long-input LMs in representing long documents can improve the quality of representation of textual data, and therefore, help in improving predictive analyses.more » « less
-
Researchers have utilized Other Test Method (OTM) 33A to quantify methane emissions from natural gas infrastructure. Historically, errors have been reported based on a population of measurements compared to known controlled releases of methane. These errors have been reported as 2σ errors of ±70%. However, little research has been performed on the minimum attainable uncertainty of any one measurement. We present two methods of uncertainty estimation. The first was the measurement uncertainty of the state-of-the-art equipment, which was determined to be ±3.8% of the estimate. This was determined from bootstrapped measurements compared to controlled releases. The second approach of uncertainty estimation was a modified Hollinger and Richardson (H&R) method which was developed for quantifying the uncertainty of eddy covariance measurements. Using a modified version of this method applied to OTM 33A measurements, it was determined that uncertainty of any given measurement was ±17%. Combining measurement uncertainty with that of stochasticity produced a total minimum uncertainty of 17.4%. Due to the current nature of stationary single-sensor measurements and the stochasticity of atmospheric data, such uncertainties will always be present. This is critical in understanding the transport of methane emissions and indirect measurements obtained from the natural gas industry.more » « less
-
We tested the hypothesis that carbon dioxide (CO2) uptake fluxes in coastal salt marshes follow ecological similitudes (parameter reductions) and distinct environmental regimes. The hypothesis was evaluated utilizing data from four salt marshes in Waquoit Bay, MA, USA collected during May-October 2013. Using dimensional analysis method from fluid mechanics and engineering, we reduced five flux and ecological variables (CO2 uptake, light, soil temperature, salinity, and atmospheric pressure) into two mechanistically meaningful dimensionless groups: (a) light use efficiency number (LUE = CO2 uptake normalized by daylight) and (b) biogeochemical number (BGC = interactions among soil temperature, salinity, and atmospheric pressure). Graphical exploration of the dimensionless numbers with the observed data revealed an emergent pattern that was distinctly characterized by high, transitional, and low LUE regimes. Transitions among the identified regimes were dictated by thresholds of soil temperature and salinity. Low LUE regime corresponded to unfavorable environmental conditions (soil temperature 17C and salinity > 30ppt), whereas high LUE regime was governed by favorable conditions (soil temperature > 17C and salinity 30ppt). The identified emergent pattern and environmental thresholds would provide key insights into the underlying organizing principles of CO2 uptake and the major environmental drivers in coastal salt marshes.more » « less
-
We tested the hypothesis that carbon dioxide (CO2) uptake fluxes in coastal salt marshes follow ecological similitudes (parameter reductions) and distinct environmental regimes. The hypothesis was evaluated utilizing data from four salt marshes in Waquoit Bay, MA, USA collected during May-October 2013. Using dimensional analysis method from fluid mechanics and engineering, we reduced five flux and ecological variables (CO2 uptake, light, soil temperature, salinity, and atmospheric pressure) into two mechanistically meaningful dimensionless groups: (a) light use efficiency number (LUE = CO2 uptake normalized by daylight) and (b) biogeochemical number (BGC = interactions among soil temperature, salinity, and atmospheric pressure). Graphical exploration of the dimensionless numbers with the observed data revealed an emergent pattern that was distinctly characterized by high, transitional, and low LUE regimes. Transitions among the identified regimes were dictated by thresholds of soil temperature and salinity. Low LUE regime corresponded to unfavorable environmental conditions (soil temperature 17C and salinity > 30ppt), whereas high LUE regime was governed by favorable conditions (soil temperature > 17C and salinity 30ppt). The identified emergent pattern and environmental thresholds would provide key insights into the underlying organizing principles of CO2 uptake and the major environmental drivers in coastal salt marshes.more » « less
An official website of the United States government

Full Text Available